Reuse of a Proper Noun Recognit ion System in Commercial and Operational NLP Applications

ثبت نشده
چکیده

SRA's proprietary product, NameTag TM, which provides fast and accurate name recognition, has been reused in many applications in recent and ongoing efforts, including multilingual information retrieval and browsing, text clustering, and assistance to manual text indexing. This paper reports on SRA's experience in embedding name recognition in these three specific applications, and the mutual impacts that occur, both on the algorithmic level and in the role that name recognition plays in user interaction with a system. In the course of this, we touch upon various interactions between proper name recognition and machine translation (MT), as well as the role of accurate name recognition in improving the performance of word segmentation algorithms needed for languages whose writing systems do not segment words. 1 I n t r o d u c t i o n Fast and accura te name recognition products are only now coming onto the market. SRA's proprietary product, NameTag, has been reused in many applications in recent and ongoing efforts, including multilingual information retrieval and browsing, text clustering, and assistance to manual text indexing. In the following paper, we report on our experience in embedding name recognition in these, three specific applications, as well as the mutual impacts that occur, both on the algorithmic level and in the role that name recognition plays in user interaction with a system. In the course of this, we touch upon various interactions between proper name recognition and machine translation (MT), as well as the role of accurate name recognition in improving the performance of word segmentation algorithms needed for languages such as Japanese. Name recognition clearly offers added value when integrated with other algorithms and systems, but the latter also affect the way in which name recognition is performed, specifically the choice of high-recall or high-precision strategies. But first, we discuss the relevant features of NameTag. 2 D e s c r i p t i o n o f N a m e T a g NameTag is a multilingual name recognition system. It finds and disambiguates in texts the names of people, organizations, and places, as well as time and numeric expressions with very high accuracy. The design of the system makes possible the dynamic recognition of names: NameTag does not rely on long lists of known names. Instead, NameTag makes use of a flexible pattern specification language to identify novel names that have not been encountered previously. In addition, NameTag can recognize and link variants of names in the same document automatically. For instance, it can link "IBM" to "International Business Machines" and "President Clinton" to "Bill Clinton." NameTag incorporates a language-independent C-t-+ pattern-matching engine along with the language-specific lexicons, patterns, and other resources necessary for each language. In addition, the Japanese, Chinese, and Thai versions integrate word segmenters to deal with the orthographic challenges of these languages. (NameTag currently has these language versions available plus ones for English, Spanish, and French.) NameTag is an extremely fast and robust system that can be easily integrated with other applications through its API. It has been our experience that NameTag has lent itself to so many successful integrations in diverse applications not just due to its accuracy, but to its speed. (Its NT version is currently benchmarked at 300 megabytes/hour on a Pentium Pro.) It is an attractive package to embed in an application, as it does not cause significant retardation of performance. In the following discussion, we refer to various versions of NameTag, most prominently systems for English and Japanese. Their extraction accuracy varies. For example, in the Sixth Message Understanding Conference (MUC-6), the English systern was benchmarked against the Wall Street Journal blind test set for the name tagging task, and achieved a 96% F-measure, which is a combination ot" recall and precision measures. Our internal testing of the Japanese system against blind test sets of w~rious Japanese newspaper articles indicates that it achieves from high-80 to 1ow-90% accuracy, depending on the types of corpora. Indexing names in Japanese texts is usually more challenging than English for two main reasons. First, there is no case distinction in .Japanese, whereas English names in newspapers are capitalized, and capitalization is a very strong clue for English name tagging. Second, Japanese words are not separated by spaces and therefore must be segmented into separate words before the name tagging process. As segmentation is not 100% accurate, segmentat ion errors can sometimes can use name tagging rules not to fire or to misfire. 3 P r o p e r N a m e R e c o g n i t i o n I n t e g r a t e d W i t h a B r o w s i n g & R e t r i e v a l S y s t e m We have recently developed a system incorporating NarneTag tha t allows monolingual users to access information on the World Wide Web in languages that they do not know (Aone, Charocopos, and Gorlinsky, 1997). For example, previously it was not easy for a monolingual English speaker to locate necessary information writ ten in Japanese. The user would not know the query terms in Japanese even if the search engine accepted Japanese queries. In addition, even when the users located a possibly relevant text in Japanese, they would have little idea about what was in the text. Output of off-the-shelf machine translat ion (MT) systems are often of low quality, and even "high-end" MT systems have problems part icularly in translat ing proper names and specialized domain terms, which often contain the most critical information to the users. Now these users have available our multilingual (or cross-linguistic) information browsing and retrieval system, which is aimed at monolingual users who are interested in information from multiple language sources. The system takes advantage of namerecognition software as embodied in NameTag to improve the accuracy of cross-linguistic retrieval and to provide innovative methods t.o browse and explore multi l ingual document collections. The system indexes texts in different languages (currently English and Japanese) and allows the users to retrieve relevant texts in their native language (currently English). The retrieved text is then presented to the users with proper names and specialized domain terms t ranslated and hyperlinked. Among the innovations in our system is the stress placed upon proper names and their role as indices for document content. The system consists of an Indexing Module, a Client Module, and a Term Translat ion Module. The Indexing Module creates and inserts indices into a database while the Client, Module allows browsing and retrieval of information in the database through a Web-browser-based graphical user interface ((~ IJ l). The Term Translation Module dynamical ly translates English user queries into Japanese and the indexed terms in retrieved Japanese documents into English. T h e I n d e x i n g M o d u l e For the present application, the system indexes names of people, entities, and locations, as well as scientific and technical (S&T) terms in both English and Japanese texts, and allows the user to query and browse the indexed database in English. As NameTag processes texts, the indexed terms are stored in a relational database with their semantic type information (person, entity, place, S&T term) and alias information along with such meta da ta as source, date, language, and frequency information. T h e C l i en t M o d u l e The Client Module lets the user both retrieve and browse information in the database through the Web-browser-based GUI. In the query mode, a formbased Boolean query issued by a user is au tomat i cally translated into an SQL query, and the English terms in the query are sent to the Term Translat ion Module. The Client Module then retrieves documents which match either the original English query or the translated .Japanese query. As the indices are names and terms which may consist of multiple words (e.g., "Warren Christopher," "memory chip"), the query terms are delimited in separate boxes in the form, making sure no ambigui ty occurs in both translation and retrieval. In its browsing mode, the Client Module allows the user to browse the information in the database in various ways. For example, once the user selects a particular document for viewing, the client sends it to an appropriate (i.e., English or Japanese) indexing server for creating hyperlinks for the indexed terms, and, in the case of a Japanese document , sends the indexed terms to the Term Translat ion Module to translate the Japanese terms into English. The result that the user browses is a document each of whose indexed terms are hyperlinked to other documents containing the same indexed terms. Since hyperlinking is based on the original or t ranslated English terms, the monolingual English speaker can follow the links to both English and .Japanese documents transparently. In addition, the Client Module is integrated with a commercial MT sys tem for a rough translation of the whole text. T h e T e r m T r a n s l a t i o n M o d u l e The Term Translation Module is used by the Client Module bi-directionally in two different modes. Tha t is. it translates English query terms into Japanese in the query mode and, in reverse, translates Japanese indexed terms into English for viewing of a retrieved Japanese text in the browsing mode. 3.1 I s sues C o n c e r n i n g P r o p e r N a m e R e c o g n i t i o n fo r B r o w s i n g a n d R e t r i e v a l Based on the system description above in the preceding sections, we describe in more detail in the following the impacts of name recognition on multilingual browsing and retrieval. 3.1.1 I n d e x i n g A c c u r a c y To index, the system uses two different configurations of NameTag for English and Japanese. Indexing of names is particularly significant in the Japanese case, where the accuracy of indexing depends on the accuracy of segmentation of a sentence. In English, since words are separated by spaces, there is no issue of indexing accuracy for individual words. However, in languages such as Japanese, where word boundaries are not explicitly marked by spaces, word segmentation is necessary to

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Investigating Embedded Question Reuse in Question Answering

The investigation presented in this paper is a novel method in question answering (QA) that enables a QA system to gain performance through reuse of information in the answer to one question to answer another related question. Our analysis shows that a pair of question in a general open domain QA can have embedding relation through their mentions of noun phrase expressions. We present methods f...

متن کامل

Development of Lifetime Prediction Model of Lithium-Ion Battery Based on Minimizing Prediction Errors of Cycling and Operational Time Degradation Using Genetic Algorithm

Accurate lifetime prediction of lithium-ion batteries is a great challenge for the researchers and engineers involved in battery applications in electric vehicles and satellites.  In this study, a semi-empirical model is introduced to predict the capacity loss of lithium-ion batteries as a function of charge and discharge cycles, operational time, and temperature. The model parameters are obtai...

متن کامل

Membrane Biological Reactors (MBR) and Their Applications for Water Reuse

Know days, pollution made by the wastewater in rivers and other water body’s is one of the main concerns of environmental engineers. Membrane bioreactors are one of the earliest methods for treating swage and also to produce water that is acceptable for reuse purposes. The term membrane bioreactor expresses a combination of activated sludge and membrane separation processes. The need to process...

متن کامل

Enhanced wettability and electrolyte uptake of coated commercial polypropylene separators with inorganic nanopowders for application in lithium-ion battery

In this research, inorganic material type and content influence on coating of commercially available polypropylene (PP) separator were studied for improving its performance and safety as lithium ion battery separator. Heat-resistant nanopowders of Al2O3, SiO2 and ZrO2 were coated using polyvinylidene fluoride (PVDF) binder. Coating effects on the separators morphology, wettability, high tempera...

متن کامل

The operational matrix of fractional derivative of the fractional-order Chebyshev functions and its applications

In this paper, we introduce a family of fractional-order Chebyshev functions based on the classical Chebyshev polynomials. We calculate and derive the operational matrix of derivative of fractional order $gamma$ in the Caputo sense using the fractional-order Chebyshev functions. This matrix yields to low computational cost of numerical solution of fractional order differential equations to the ...

متن کامل

Investigating the Effect of Commercial and Operational Factors on Competitiveness Improvement

The present study was performed to investigate the effect of commercial and operational factors on competitiveness improvement of Shahid Beheshti Port (SBP) in Chabahar in attracting shipping lines. The research method is applied in terms of purpose. Initially, the researcher conducted the qualitative phase to identify important variables, create a typology, or make new theories. To reach the s...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2002